On the dependency of word length on text length. Empirical results from Russian and Bulgarian parallel texts

نویسنده

  • Emmerich Kelih
چکیده

This paper tackles two basic problems of quantitative linguistics: firstly the “word length” and secondly the text length in terms of type and token numbers. It has to be shown that these two basic properties of a text are directly related. The interrelation between word length and text length can be captured by an appropriate mathematical model; hence a law-like status of the interrelation between word length and text length can be stated. Up to now, no special attention has been paid to this particular kind of self regulation (increase of the word length with increasing text length) of the text structure, which can be explained by a general control of the information flow on the lexical level. After a theoretical explanation of the interrelation of word and text length, some empirical results are presented on the basis of Russian and Bulgarian parallel texts The Master and Margarita).

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

اثر طول کلمه و نوع حروف چاپی (فونت) بر حرکات چشم (تثبیت، جهش، واپسگرد) در حین خواندن متن آشنا و ناآشنا

The main purpose of this study was to determine the effects of word length and font type of eye movement (fixation, saccade, regression) during reading Persian familiar and unfamiliar passages. 29 female students of Shahid Beheshti University Education & Psychology Faculty participated in this study. Stimuli consisted of four different texts consisting of familiar text-lotus font, familiar-pen ...

متن کامل

Comparing k-means clusters on parallel Persian-English corpus

This paper compares clusters of aligned Persian and English texts obtained from k-means method. Text clustering has many applications in various fields of natural language processing. So far, much English documents clustering research has been accomplished. Now this question arises, are the results of them extendable to other languages? Since the goal of document clustering is grouping of docum...

متن کامل

Syntactic Complexity of Russian Unified State Exam Texts in English: A Study on Reliability and Validity

In this study we analyze texts used in Russian Unified State Exam on English language. Texts that formed small research corpora were retrieved from 2 resources: official USE database as a reference point, and popular website used by pupils for USE training “Neznaika” (https://neznaika.pro/). The size of two corpora is balanced: USE has 11934 tokens and “Neznaika” - 11918 tokens. We share Biber’...

متن کامل

Producing a Persian Text Tokenizer Corpus Focusing on Its Computational Linguistics Considerations

The main task of the tokenization is to divide the sentences of the text into its constituent units and remove punctuation marks (dots, commas, etc.). Each unit is a continuous lexical or grammatical writing chain that is an independent semantic unit. Tokenization occurs at the word level and the extracted units can be used as input to other components such as stemmer. The requirement to create...

متن کامل

استخراج پیکره‌ موازی از اسناد قابل‌مقایسه برای بهبود کیفیت ترجمه در سیستم‌های ترجمه ماشینی

Data used for training statistical machine translation method are usually prepared from three resources: parallel, non-parallel and comparable text corpora. Parallel corpora are an ideal resource for translation but due to lack of these kinds of texts, non-parallel and comparable corpora are used either for parallel text extraction. Most of existing methods for exploiting comparable corpora loo...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012